What Are The Factors Influencing Criminality? Evidence from US

Gasparri, Eleonora and Mezzini, Lorenzo

14 dicembre, 2020


Introduction

Overview and Motivation

When thinking at welfare and government expenditure in health, we often focus on physical health and not much in mental health. Starting this project out main idea was to study the relationship between Mental Health Services and Homicides. Later, we moved to a broader view to study which are the factor that mostly impact criminality in a developed country.

After looking up for relevant data-sets on the internet, we decided to concentrate on United States of America, looking for data for each state.

The main motivation behind our project is our interest in social sciences and policies. Indeed, before starting, we decided for this topic because, possibly, our results will be interesting for a policy maker in taking decisions on education and mental health services expenditure and provision, as well as other factors.

Research questions

The research questions we will try to answering throughout our project are:

  • Is there any relationship between expenditure for mental health by the government and criminality?
  • Is the level of education and wealth (through GDP) of a State relevant for its level of criminality?
  • Is the composition of the population, in terms of both age and ethnicity, relevant for criminality in the area?
  • Is mental health expenditure affected by how much the population is educated or by GDP of the country?

Given the questions posed above, the answers we will search for in our project could lead a reader to question himself on how to exploit these presences of correlation to reach a lower level of criminality. Though, the latter consideration makes sense only if we are able to find significant relationship between the different variables.

Data

In this part we present the data we use to analyse and answer our research questions. We start by importing and cleaning them. We use a total of 17 files to form our final data-set, we present them below separately, following the themes. Notice that we clean each data-set such that they all appear “standardized” (for example we select years from 2004 to 2013, since only in this time-framework we have all data available). This is done to facilitate the join process among all data-set.

Crime Data

Sources and Description

We used the dataset on estimated crimes (from 1979 to 2019) available in the FBI website. We have repeated observations for each state in the United States of America from 1979 to 2019.

Raw data - Estimated Crimes 1979-2019 in US

The data-set contains 2116 observations for 15 variables, which are

  • year the year of the observation
  • state_abbr, state_name the abbrevation and the name of the State. Notice that the first line of each year is blank. These observations refer to the total, i.e. United States.
  • population the number of population in a given year and State.
  • violent_crime, homicide , rape_legacy, rape_revised, robbery, aggravated_assault, property_crime, burglary, larceny, motor_vehicle_theft, violent_crime, caveats each refers to the respective number of registered crimes by the FBI or of caveats. From the source we learn that homicides resulting from the events of September 11, 2001, are not included. This is fine for us, because it will be an outlier not significant for our analysis.
Missing values for each feature
Missing Values
year 0
state_abbr 41
state_name 41
population 0
violent_crime 0
homicide 0
rape_legacy 156
rape_revised 2116
robbery 0
aggravated_assault 0
property_crime 0
burglary 0
larceny 0
motor_vehicle_theft 0
caveats 2045

From above we can see how many NAs we have for each feature. Looking at this we already decide to not take into account rape_revised and caveats, while we already now that state_name and state_abbr missing values refers to United States, so we will fill them appropriately.

Wrangling/cleaning

To clean this dataset we have to transform year values into numeric. Moreover, we change the name of column state_name to State and selected only some crime which we think could be more relevant for our study and could be more impacted by mental health expenditure. We also replaced NAs in State and state_abbr with “United States” and “US”.

The cleaned dataset is called “estimated_crimes” and is reported below:

Cleaned data - Estimated Crimes 2004-2013 in US

Mental Health Expenditure Data

Sources and Description

For this part we have to download data-set for each year separately from 2004 to 2013 and you can find them at this link.

Since the structure for each year’s data-set is the same we report only the first one, for year 2004:

Raw data on Mental Health Expenditure per capita

The data-set for each year contains 51 observations for 3 variables, which are

  • Location the State or US
  • SMHA Expenditures Per Capita is the State Mental Health Agency data on expenditures in mental health per capita in each state
  • population the number of population in a given year and State.
  • footnotes which are notes on the data such as the fact that the reporting period reflects spending in state fiscal year, which may vary by state. Data are not adjusted for inflation and Puerto Rico is included in the US’s total.

Wrangling/cleaning

We cleaned data-set for each year and then we joined them. To cleaned them we remove the dollar sign $ from the expenditure per capita values, as well as transform them into numeric. We also change its name of mental health expenditure per capita to the respective year of the dataset, i.e. 2004. This is done to ease the join process, which is made by State, which is the renamed previous Location.

The resulting dataset on mental health is “mh_exp” and is reported below:

Cleaned data - Mental Health Expenditure Per Capita, 2004-2013

We can also look at how many NAs are present:

Missing values for each feature

There are some, and if you look at data you can see that the missing value comes usually from Puerto Rico’s observation.

US demographic Data

Sources and Description

We used the 3 dataset in the United States Census Bureau’s website.

The first one is about race composition from 2000 to 2010

Raw data - Demographics, Race, 2000-2010

This one contains 364 observations for 18 variables, which are

  • REGION,DIVISION, STATE, NAME which identifies the region, division, state code and name of the state.
  • RACE is the race, it goes from 0 to 6, with 0 being the total and 1-6 identifying ethinicities as White, Black/African-American and so on
  • POPBASE2000 and POPESTIMATEyear for each year, which are the estimated population in a State in a given year, for the respective race cathegory

The second one is about age and sex composition from 2000 to 2010

Raw data - Demographics, Age & Sex, 2000-2010

This one contains 13572 observations for 19 variables, which are

  • REGION,DIVISION, STATE, NAME which identifies the region, division, state code and name of the state.
  • SEX is the sex, it can be either 0 (total), 1 (male) or 2 (female)
  • AGE is the age, it goes from 0 to 85 years old, then there’s also 999 which is the total population
  • POPBASE2000 and POPESTIMATEyear for each year, which are the estimated population in a State in a given year, for the respective sex and age cathegory

The third one is about race, age and sex composition from 2010 to 2019

Raw data - Demographics, Race, Age & Sex, 2010-2019

This one contains 236844 observations for 21 variables, which are

  • SUMLEV is the identification of the summary levels used by the census, it is also called “area type”
  • REGION,DIVISION, STATE, NAME which identifies the region, division, state code and name of the state.
  • RACE is the race, it goes from 1 to 6, 1-6 identifying ethinicities as White, Black/African-American and so on
  • ORIGIN is the origin, it can be 0 (total), 1 (Not Hispanic) or 2(Hispanic), although this sata is absent between 2000 and 2009 so we will omit it
  • SEX is the sex, it can be either 0 (total), 1 (male) or 2 (female)
  • AGE is the age, it goes from 0 to 84 years old, then we have 85 which comprises 85+ years old
  • POPBASE2010 and POPESTIMATEyear for each year, which are the estimated population in a State in a given year, for the respective race, sex and age cathegory

Wrangling/cleaning

The cleaning is done for each data-set separately. Later we proceed to join them. In all data-set we trasform REGION into a factor and we rename the levels such that total US replaces 0, North-East (NE) replaces 1, Mid-West (MW) replaces 2, South (S) replaces 3, and West (W) replaces 4.

Also RACE and SEX will become factors with respective levels labels: (White=1, BlackAfricanAmerican=2, AmericanIndianAlaska=3, Asian=4, HawaiianPacificIslanders=5, Racegreaterthan1=6 and Total= for the dataset in years 2000-2010) and (Total=0, Male=1, Female=2).

Names of variables are also changed slightly to have them in line with other data-sets and through pivot_longer and pivot_wider we adjust the structure of the table in a standardized way.

Moreover, for AGE we created some sub-groups instead of having the complete range 0-85 years old. The age groups we create are 0-17, 18-24, 25-44, 45-64, 65-84 and 85+. We still don’t know whether age composition has an impact on criminality, but we consider important to have the subgroups 18-24 and 25-44, since in education, as we will see, these two age groups are considered.

Also race groups are different in the cleaned dataset: White, BlackAfricanAmerican, Asian and Other_race. The latter comprises all the other cathegories. We also filter for years of interest (2004-2013).

In 2011-2013 we miss the observation for United States, which instead is present between 2004-2010. Therefore, we created a dataset for it by taking the sum across states, since US’s values would be the total and we put everything together for years 2011-2013 to obtain.

We end up having two data-sets on demographics, one for the years 2004-2010 and the other from 2011 to 2013. Finally, we join these obtaining the final “demographics” dataset:

Cleaned data - Demographics in US

No missing value is present. Although, notice that in the cleaning process we have to fill some Region’s values which otherwise would be missing. But, knowing the data-set and the State, it is straightforward.

Education Data

Source and Descriptions

For education we decided to look up for a proxy: Bachelor’s degree incidence in the population. We found two data-sets, one for the percentage of people between 25-44 years old with a Bachelor’s Degree, for years 2005-2018, and one for the number of bachelor’s conferred in the age range 18-24 per 1000 individuals, for years 2000-2018.

The former is:

Raw data - %25-44 years old people with a Bachelor’s Degree

This one contains 53 observations for 15 variables, which are

  • State which identifies the state, or the whole US
  • 2005 … 2018 one column for each year observed

Raw data - Per 1000 18-24 years old people conferring a Bachelor’s Degree

This one contains 53 observations for 20 variables, which are

  • State which identifies the state, or the whole US
  • 2000 … 2018 one column for each year observed

Wrangling/Cleaning

Both data-sets are cleaned separately and then put together. The main task in both is to create a new variable year and another, respectively perc_bscholder_25_44 and perc_bscconferred_18_24, therefore resulting in longer data-sets.

Notice, that the datas we had in the second data-set referred to 1000 people and was not in percentage terms as instead is perc_bscholder_25_44, therefore, to obtain perc_bscconferred_18_24 we have to divide by 1000 and multiply by 100 the data.

Another thing which is worth mentioning is the fact that in the data-set describing %25-44 years old people with a Bachelor’s Degree, we miss observations for 2004. To adjust for it we, first, create these observations as NAs, then fill them with the value from 2005. In our opinion this shouldn’t alter our analysis, because the difference from year to year is relatively small.

Then, we join the two cleaned data-sets in “edu”:

Cleaned data - Proxies for Education level

There are no missing values in this data-set, although remember that the ones we had in edu_percholder_25_44 have been filled with 2005’s values.

GDP Data

Source and Descriptions

The data-set on GDP can be found in the Bureau of Economic Analysis, of U.S. Department of Commerce, website.

Raw Data - Dataset which gives us info on GDP, our variable of interest

This has 483 observations for 27 columns, which are:
  • GeoFips and GeoName which identify the state, or the whole US through a code and the name, respectively
  • LineCode and Description which identify the kind of variable we are looking at, i.e. Current-dollar GDP (millions of current dollars), Real GDP (millions of chained 2012 dollars), etc… They are of 8 different kind, but we will focus on Current-dollar GDP as you will see
  • 1997 … 2019 one column for each year observed

Wrangling/Cleaning

In order to clean the dataset we filter for one values of Description only, since it’s the one of interest for us: Current-dollar GDP (millions of current dollars). We delete the column which are not relevat, remaining with renamed GeoName, which is now State, and 1997 … 2019. We use “pivot_longer” to create two variables year and Current_dollar_GDP_millions increasing the length of the data-set. Of course, we also filter for years in the frame 2004-2013.

The resulting dataset is “GDP_Cleaned”:

Cleaned Data - Current-dollar GDP (millions of current dollars)

No missing values are present in “GDP_Cleaned”.

Final Dataset

Now that we have each data-set cleaned and wrangled in a “standardized” way we can join them by State and year.

Although, simply joining them produces NAs and by looking at the data we understand that this happens because some data-set considered also Divisions and this makes appear among the States also New England, Mideast, Great Lakes, Plains, Southeast, Southwest, Rocky Mountain and Far West. We filter them out, as well as Puerto Rico. Indeed, the latter presents many missing values too.

The resulting dataset is called “project”:

Final Dataset

EDA

Data Overview

In this section we are going to do an explanatory data analysis by using the cleaned data described in the data part. Throughout the section, we will still need some transformation of the data to facilitate the visualization and to understand everything in a deeper way.

To present the crimes and the race in a nicer way, we decide to mutate the former in term of “per 1000 inhabitants” and the latter in percentage terms. This makes sense also because different countries have different dimensions and population size. Therefore, maintaining absolute magnitudes would probably give us a wrong perception and result. We don’t change the variables’ names, though.

As you may have seen in the section on data, we end up having many features. Although some of them might be irrelevant or redundant. To see that we use a straightforward correlation command; this can be already a step towards the selection of most important variables that we may need for our analysis later.

corrplot

Figure: Correlations among variables

The main findings are:

  • All variables describing the population, such as population, Female and Male, as well as age, are perfectly (or at least highly) correlated. For this reason, we can select population and ignore the amount of population which is female or male. Also because these values are always around 50% of population in each state, it wouldn’t be too informative. Notice that, White and Black African-American seem negatively correlated, as well as Age_0_17 and Age_over85. These are only two examples, but the motivation is straightforward, i.e.: if the population is very young, it can’t be old at the same time.
  • White and crimes have a negative corellation, except for rapes, although for this the correlation seems small.
  • Black/African-American is positively related with all crime, while Asian has only low correlations with them.
  • Mental Health expenditure per capita appears positively correlated with education of the population, while its correlation with GDP doesn’t seem relevant. Its correlation with crimes is dubious, we will better investigate on it with some visualization tools.
  • GDP tends to be positively correlated with crimes, with exception of rapes. We will deepen this result later.
  • It seems that a young population (18-44) leads to higher homicides, aggravated assaults and violent crimes. Meanwhile, older population (45+) appears negatively related with crimes.
  • As you know, we have considered two proxies for Education until now, although they are highly correlated and it doesn’t make sense to use both. Therefore, we decide to use perc_bscholder_25_44.
  • Let’s consider also correlations among crimes. As we would expect, the correlation between the different crimes is positive, indicating that there’s little differentation. So, whenever criminality in a state is high, the level of all crimes is, more or less, high. Although, among them, rape seems to be the less correlated with the others.

Moving forward, having observed the correlations above, we can also look into each variable.

To do so we can look up easily at the outcome of the data-set’s summary.

For age:

Age’s variables summary statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
Age_0_17 0.1676 0.2288 0.2401 0.2394 0.2495 0.3150
Age_18_24 0.0829 0.0967 0.0999 0.1010 0.1032 0.1446
Age_25_44 0.2308 0.2542 0.2643 0.2662 0.2761 0.3680
Age_45_64 0.1144 0.2519 0.2623 0.2609 0.2714 0.3122
Age_65_84 0.0589 0.1075 0.1151 0.1140 0.1214 0.1609
Age_over85 0.0050 0.0151 0.0174 0.0177 0.0205 0.0269

The highest percentage of population is between 25 and 64 years old while the lowest has more than 85 years.

For race:

Race’s variables summary statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
White 0.2557 0.7373 0.8339 0.8020 0.8908 0.9654
BlackAfricanAmerican 0.0041 0.0326 0.0770 0.1145 0.1566 0.5812
Asian 0.0060 0.0138 0.0229 0.0372 0.0407 0.4083
Other_race 0.0122 0.0210 0.0273 0.0463 0.0443 0.3396

The majority of the population is white, followed by black and African-American.

For crimes:

Crimes’ variables summary statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
homicide 0.0084 0.0269 0.0454 0.0494 0.0611 0.3487
violent_crime 0.8655 2.6797 3.5792 4.0563 5.0244 15.3711
rape_legacy 0.0972 0.2576 0.3131 0.3271 0.3830 0.8914
aggravated_assault 0.5119 1.5712 2.2706 2.5683 3.3108 8.0413

Remember that crimes are expressed in per 1000 terms.
Homicides are the less common crime, while violent crimes and aggravated assault occur on average to 4 and 2.5 people out of 1000.

For mental health expenditure, education, population and GDP:

Other variables summary statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
mh_exp_pc 2.423e+01 7.144e+01 9.883e+01 1.201e+02 1.451e+02 4.099e+02
perc_bscconferred_18_24 1.939e+00 4.586e+00 5.505e+00 5.661e+00 6.397e+00 1.374e+01
perc_bscholder_25_44 1.948e+01 2.572e+01 2.987e+01 3.056e+01 3.408e+01 6.535e+01
Current_dollar_GDP_millions 2.266e+04 7.300e+04 1.738e+05 5.605e+05 3.818e+05 1.678e+07
population 5.091e+05 1.715e+06 4.352e+06 1.173e+07 7.092e+06 3.160e+08

In this last summary table, it’s worth mentioning that

  • The variablility of mental health expenditure per capita seems high.
  • Population and GDP are not really interesting withouth further analysis and grouping by state or region, since the size of states can be very different, impacting these two variables.

Univariate visualizations

To present the most important data by State we created an interactive map you can access by clicking here which shows the selected variable distribution in US’s states in a given year. Just to give you a preview of what you can see through our interactive map we report here the part of it with population values in 2004:

State’s Population in 2004 in US

Moreover, we try to analyze graphically the main variables separately in order to potentially detect outliers or interesting path/characteristics.

We start with a time series for mental health expenditure per capita, both for the whole US and the single regions. To do so we compute the median value in each region for every year and create a time series on R. Then we plot the whole thing in one graph:

Mental Health Expenditure (per capita)

We can see that in general, the expenditure per capita has increased from 2004 to 2013, with some ups and downs throughout the period. The downward sloping part are especially relevant in two regions, West and South between 2009 and 2010/11. We don’t have enough data, but a possible explanation could be the financial crisis which had impact on government budgeting. The largest difference between 2004 and 2013 values is observed for North-East, while the smallest is for South, of which gap between these years is of $7 circa. In the time series we only look at the median. It could be interesting to observe the same data through a boxplot to understand variability and outliers.

We start by looking at each regions and US in total.

We notice that US has a low variability, but here data for US are already considered as a total, it doesn’t consider each state observed. Instead, for the regions we capture, as before, that North-East is the one with largest variation, and we already know from the time series that this is due to the steadily increase in mh_exp per capita over the years.

The boxplots are ordered by median and we can see how North-East is the one with greatest median and how US’s median (which we can consider as the mean median across regions) is second for magnitude. Thus, it’s driven significantly by North East states expenditure.

South and Mid-West are the regions in which states seem to spend less for mental health expenditure in per capita terms.

We can clearly observe some outliers. But you can notice that they are quite clustered. Probably each group of outliers represents a state’s obervations in different years. These are not a problem for our analysis, therefore we just continue.

The second boxplot we propose is to shed the light on each region’s state.

boxplot

As we expected, in regions such as South and West, where we observed outliers in the boxplot before, there are states which appear far from the others. These are District of Columbia and Alaska. The latter is indeed on the west coast, but it’s somehow detached from the other states of the region. Also District of Columbia is a case on its own since it’s not a proper state but a federal district.

We confirm that Mid-West is the region with less variability among its states in mental health expenditure per capita.

Demographic: Age and Race Composition of the Population

Let’s continue our univariate visualization part with demographics variables.

We do so by exploiting barplots. Again, we group results by regions as it can give us an idea of the distribution of population among the different US’s areas. Of course, we continue to look also at the total US. To group results by region we took median values and computed percentages of the population.

We start with a barplot for race composition of the population:

We immediately observe that between total US and North-East the difference is minimal. Although, no large difference is present for any of the region. In all of them there is a high prelevance of white people. The percentage for them is the highest in Mid-West area, while there’s a particular high percentage of Black/African-American population in the South.
Moreover, while the group “other race” is a minority everywhere, it is not in the West, where instead Black/African-American percentage is lower than both asian and other races.

Now on age composition:

The same results on overall observations throughout US as we had on the summary table in the data overview section return here. What’s new is the fact that we can make consideration on the “age” of each region. Although the composition of the population does not change in a relevant way.

Demographic: Education

Again, we group results by region and we took median values of the percentage of bachelor’s degree holder with age 25-44. We can notice from the following graph that, using our proxy for education, we have a lower percentage of bachelor’s holder in the South. Instead, North-East seem has 6% more educated people than the mean value of US.

We also look at a boxplot to understand the variability of education inside each region. The variability is not too high, although we observe some outliers in South, again we think they are due to District of Columbia:

Criminality Distribution Across States

Again, we group results by region we took median values and transform values in per 1000 terms. So, finally we ask ourselves the distribution of crimes in US.

South and West have the highest level of criminality, with a great departure from other regions for violent crimes and aggravated assaults. Violent crimes seem to be the most common crime, while homicide is the least frequent and it is the lowest in Mid-West.

Multivariate visualizations

Now that we discussed variables by themselves, we can start to see the various relationships that exist between multiple variables at the same time. Notice that when appropriate we use a log10 scale. This is useful for some of our variables because they cover a large range of values. We also decide to remove District of Columbia and United States, since in most cases the first creates outliners and is not a proper state and because US are just a total observation.

Since from the corrplot in the first part of the EDA section the correlation between mental health expenditure and criminality appeared dubious we start investigating this relationship through a scatterplot. We consider mental health expenditure per capita against the various kinds of criminality: homicides, violent crime, rape and aggravated assault.

From the scatterplot above we can see that the overall correlation is slightly negative. Which means that for an increase in public mental health per capita spending there is, on average, a decrease in criminality.

Remember that from the corrplot we have identified a positive correlation between education and mental health expenditure per capita. The higher the education the higher is the spending for health. We tried to show this through a scatterplot and the outcome is exaclty what we expected by the corrplot, even if we don’t rule out the District of Columbia.

This second scatterplot that we propose is criminality against education.

Here we can see two distinct things. First of all the correlation is negative, thus, on average, the higher the education the lower is the criminality rate. The second thing we can notice is about the log GDP. In all the criminalities, except rape, the lighter dots (higher GDP) lies above the tendency line, while the darker dots (lower GDP) lies below. Therefore, there is a positive correlation between GDP and the kind of crimes we considered, except for rape, that has a negative correlation.

Now we want to see a few of the correlation we saw before, but in the time dimension.

First of all the effect of mental health expenditure on criminality over time. We decided to report here the time series for only one crime, homicide, since the patterns are similar for all the four of them:

From this time series we can see how in the US the number of total homicides decreases over time. This can possibly follows from an increase in the mental health.

Now we check the mental health spending against the education over time.

Here we see that the increase in education over the selected decade also correspond to an increase in the public expenditure in mental health.

Finally we check the homicides against the education. Again, the pattern is similar also for other type of crimes, therefore we report only the one for homicides:

This final time series shows that in the decade of interest the decrease of homicides also correspond to an increase in education.

However to better understand all of these effect and draw stronger conclusions we should do some panel data analysis on the data-set as we will do in the next section.

Analysis

Analysis through Regressions: Methodolody, Selection and Justification

To further study the impact of the different factor on criminality we try to exploit econometric regressions.

We start with a standard OLS for a model, where we consider total criminality per 1000 people, as the sum of rape, homicide, violent crime and aggravated assault, all in per 1000 terms. The OLS model is: \[ \begin{align*} Criminality \, per \, 1000 \, inhabitants&=\alpha+\beta_1log(GDP)+\beta_2mh\_exp\_pc+\beta_3perc\_bscholder\_25\_44+\\ & \,\,+\beta_4 White+\beta_5 BlackAfricanAmerican+\beta_6Asian + \\ & \,\, + \beta_7Age\_0\_17 + \beta_8Age\_18\_24+ \beta_9Age\_25\_44+ \\ & \,\,+\beta_{10}Age\_45\_64+ \beta_{11}Age\_65\_84+\beta_{12}log(population) \end{align*} \] Running the regression we obtain the following coefficients’ estimates:

Standard OLS, dependent variable: Total Criminality per 1000 inhabitants
  total_criminality
Predictors Estimates CI p
(Intercept) -146.90 -254.92 – -38.88 0.008
Current_dollar_GDP_millions
[log]
4.73 3.52 – 5.94 <0.001
mh_exp_pc [log] -0.30 -0.73 – 0.12 0.162
perc_bscholder_25_44 -0.09 -0.15 – -0.03 0.002
White -30.76 -36.96 – -24.57 <0.001
BlackAfricanAmerican -19.52 -25.57 – -13.48 <0.001
Asian -58.17 -69.43 – -46.90 <0.001
Age_0_17 158.71 50.56 – 266.85 0.004
Age_18_24 197.17 78.31 – 316.04 0.001
Age_25_44 241.96 142.86 – 341.06 <0.001
Age_45_64 161.16 55.17 – 267.14 0.003
Age_65_84 255.53 128.73 – 382.32 <0.001
population [log] -4.20 -5.41 – -2.98 <0.001
Observations 505
R2 / R2 adjusted 0.648 / 0.639

We notice that GDP, mh_exp_pc and education’s proxy have coefficients which we could have expected by the EDA we have done previously. Indeed, GDP increases criminality while mental health expenditure and education seems to decrease it. Although, among them only \(log(GDP)\) and education are statistically significant. Surprisingly, all races have a negative effect on criminality; this doesn’t sound a convincing result since the correlation of criminality with black-african american seemed positive in the corrplot in the EDA section. By looking at the table, we see that all groups of age in percentage of the population are significant. Although, being all coefficients positive, we think there could be some mi-specification leading to biased estimators. In general, we don’t think this regression can be informative for us, since we are not considering characteristics specific to the country and the year. Indeed, using a standard OLS we ignore the fact that our data-set is a panel data.

Therefore, we tried to identify our data-frame as a panel data and to compute regression with fixed effect, random effect and first difference. Before proceeding we will explain briefly each of them:

  • Fixed Effect: Using a “within” method allows to control for variables which remains constant over timaqe. In our case, any change given from being a certain state in US to criminality, is the same.
  • Random Effect: these are the opposite of the onee above. Taking random effects into account, is like taking into account effects which are unpredictable.
  • First Difference: this method is used to deal with omitted variable problem in panel data and it is consistent under the same assumption of the fixed effect method. As the fixed effect method it accounts for effects which are constant over time, indeed with T=2 the two should give the same result.

We try to run all regression, but after some consideration we think the most appropriate for our case is fixed effect method and the reasons are:

  • Doing an Hausman test between the fixed effect and the random effect regression we end up selecting the first one. Indeed, the random effect method includes additional strong assumptions (such as unobserved heterogeneity and independent variables being uncorrelated) than the fixed effect. If these are true, then it would be more efficient to accept the coefficients resulting from the random effect regression. Although, if these assumptions don’t hold we would have wrong results. We try the Hausman test for many regression (with dependent variable: each crime separately and total crimes as well as total crimes minus rape) and for all, it turns out that we should favor the fixed effect regression. i.e.: for total criminality as dependent variable we obtain a low p-value. This means that we can reject the hypothesis of the two regressions giving same results with a 1% statistical significance. Thus, the random effect method would give us biased results. (Notice that when computing the regression with Random Effects you can include also control variables which are constant overtime; indeed, we include also region when trying the RE regression).
  • Both fixed effect and first difference take into account fixed effect, allowing us to deal with possible omitted variable which are constant overtime. This holds since these method deals with time invariant unobserved variables. Indeed, first difference method is another way to remove unobserved heterogeneities subtracting the lagged observation rather than group mean, as in fixed effect. First differencing is usually suggested when the number of observations N is small, and you have observation for a long time framework (i.e. T is large). Although, in our case, we only have T=10, while we have 52 different unit, if we consider the total United states too, 51 otherwise. For this reason we decide to use Fixed Effect.

An additional consideration we do is whether to use or not clustered standard errors. The advantage of using them would be to account for within-cluster correlation or heteroskedasticity which the fixed-effects estimator alone does not take into account. Notice that cluster-adjusted standard error take into account standard error but leave your point estimates unchanged. The results are not changing in a relevant way considering clustered-adjusted standard errors or not, though.

We would like to point out also another thought we had while running regressions. In the EDA part we have seen how Rape seems to be the only kind of crime, among the one we are considering, to behave and to be influenced differently by GDP and slightly also by the other variables. For this reason we tried to run different regressions, with as dependent variable (in per 1000 term):

  • a group of all crimes but rape
  • all crimes
  • each single crime on its own

In all the regressions we don’t consider Unites States since would be redundant, being a total of the other states.

Answers to the research questions

We report here the results which are worth mentioning in our opinion. As said above, we select the fixed effect method. The model is: \[ \begin{align*} Y_{i,t} &=\alpha+\beta_1log(GDP)+\beta_2mh\_exp\_pc+\beta_3perc\_bscholder\_25\_44+\\ & \,\,+\beta_4 White+\beta_5 BlackAfricanAmerican+\beta_6Asian + \\ & \,\, + \beta_7Age\_0\_17 + \beta_8Age\_18\_24+ \beta_9Age\_25\_44+ \\ & \,\,+\beta_{10}Age\_45\_64+ \beta_{11}Age\_65\_84+\beta_{12}log(population) \end{align*} \] \(Y_{i,t}\) refers to the dependent variable for state \(i\) at time \(t\). The estimation is done considering \(Y_{i,t}-\bar{Y_i}\), where \(\bar{Y_i}\) is the mean dependent variable for the state \(i\). indeed \(\alpha\) will not appear in the results, as it is constant overtime.

For total criminality regression’s results are:

Fixed Effect, dependent variable: Total Criminality per 1000 inhabitants
  total_criminality
Predictors Estimates CI p
Current_dollar_GDP_millions
[log]
3.88 2.74 – 5.01 <0.001
mh_exp_pc [log] 0.16 -0.20 – 0.51 0.392
perc_bscholder_25_44 -0.05 -0.13 – 0.02 0.149
White -33.85 -81.35 – 13.64 0.163
BlackAfricanAmerican -3.33 -53.89 – 47.23 0.897
Asian -75.27 -131.71 – -18.83 0.009
Age_0_17 268.59 87.68 – 449.50 0.004
Age_18_24 242.92 42.55 – 443.30 0.018
Age_25_44 260.26 69.73 – 450.78 0.008
Age_45_64 267.36 70.77 – 463.94 0.008
Age_65_84 246.68 50.23 – 443.12 0.014
population [log] -15.85 -20.10 – -11.59 <0.001
Observations 505
R2 / R2 adjusted 0.471 / 0.397

We can notice that the \(R^2\), which is a statistical measure representing the proportion of the variance for a dependent variable that’s explained by independent variables in a regression model, is lower here with respect to the standard OLS. With respect to the standard OLS estimations, magnitudes changes but not of sign. The only exception is mental health expenditure which, here, appears having a positive effect on criminality. Although, mh_exp_pc and education’s proxy are not statistically significant anymore. Additionally, among races, only the percentage of asian in the population seems statistically significant and still negative influencing criminality. As in the OLS estimates, \(log(population)\) decreases criminality: as population increases by 1%, criminality decreases by 16 crimes per 1000 inhabitants circa.

Among the various regressions we run, only the ones with rape and homicide as dependent variables have different results from the one just presented above.

For Rape:

Fixed Effect, dependent variable: Rapes per 1000 inhabitants
  rape_legacy
Predictors Estimates CI p
Current_dollar_GDP_millions
[log]
0.0762 0.0087 – 0.1438 0.027
mh_exp_pc [log] 0.0131 -0.0081 – 0.0343 0.226
perc_bscholder_25_44 -0.0001 -0.0044 – 0.0043 0.976
White 1.2084 -1.6239 – 4.0407 0.403
BlackAfricanAmerican 1.5597 -1.4553 – 4.5746 0.311
Asian -0.2739 -3.6396 – 3.0918 0.873
Age_0_17 2.8333 -7.9549 – 13.6216 0.607
Age_18_24 3.3037 -8.6452 – 15.2527 0.588
Age_25_44 5.9642 -5.3975 – 17.3260 0.304
Age_45_64 4.9356 -6.7876 – 16.6588 0.410
Age_65_84 3.1876 -8.5273 – 14.9025 0.594
population [log] -0.3180 -0.5715 – -0.0644 0.014
Observations 505
R2 / R2 adjusted 0.216 / 0.106

From the FE regression with Rape per 1000 inhabitants as dependent variable we learn that:

  • Rape appears, in magnitude, less impacted by GDP and population than whole criminality, but still respectively positively and negatively,
  • \(R^2\) is very low, so probably the regression does not explicate of the variance in Rape in a state in a certain year,
  • Only GDP and \(log(Population)\) are statistically significant, the other variables are not.

For Homicides:

Fixed Effect, dependent variable: Homicides per 1000 inhabitants
  homicide
Predictors Estimates CI p
Current_dollar_GDP_millions
[log]
0.051 0.037 – 0.065 <0.001
mh_exp_pc [log] 0.004 -0.001 – 0.008 0.084
perc_bscholder_25_44 -0.001 -0.002 – -0.000 0.004
White -0.061 -0.661 – 0.539 0.842
BlackAfricanAmerican 1.212 0.574 – 1.851 <0.001
Asian -0.820 -1.533 – -0.107 0.025
Age_0_17 2.520 0.235 – 4.805 0.031
Age_18_24 1.483 -1.048 – 4.014 0.251
Age_25_44 0.931 -1.475 – 3.338 0.448
Age_45_64 1.394 -1.088 – 3.877 0.272
Age_65_84 2.087 -0.394 – 4.568 0.100
population [log] -0.209 -0.262 – -0.155 <0.001
Observations 505
R2 / R2 adjusted 0.683 / 0.638

From the FE regression with Homicides per 1000 inhabitants as dependent variable we learn that:

  • Rape appears, in magnitude, less impacted by GDP and population than whole criminality, but still respectively positively and negatively,
  • \(R^2=0.683\), so the variables in the regression explain 68% of the variance in homicides per 1000 people in a state in a certain year,
  • Black African American and Asian percentages in the population have a, respectively, statistically significant positive and negative effect on criminality per 1000 terms. *Having a high percentage of very young population appears to increase homicides at 5% significance, but this is difficult to explain through social mechanism in a community for us.

Is there any relationship between expenditure for mental health by the government and criminality?

The answer is inconclusive. Our study and analysis reports slightly positive correlations with crimes if we look at the Corrplot’s Figure (only exception is with Rape), but from the regression it doesn’t result statistically significant. Although, the relationship between mental health expenditure and crimes appears negative from the scatterplot and the time series we have seen in some section above.

Is the level of education and wealth (through GDP) of a State relevant for its level of criminality?

For GDP we can say that:

  • Its relationship with criminality is coherent throughout all our analysis. The outcome of our study is a positive effect of GDP on criminality.
  • For Rape the answer is more dubious. From the regression we learn that the impact is positive but much lower than for total criminality as a whole. Instead, from the following scatterplot we would say that as GDP increases, Rape decreases
  • Looking at the Fixed Effect (FE)’s estimation for total criminality we can interpret the coefficient as, if GDP in millions of dollars increases by 1%, the number of crimes in a state in a given year increases by 3.88 per 1000 inhabitants. This is statistically significant.

For Education we can say that:

  • its relationship with criminality is negative, so the higher the percentage of population with age between 25 and 44 years old holding a Bachelor’s Degree, the lower the number of crimes per 1000 people. This was confirmed from scatterplots, corrplot, time series plot and regressions.
  • FE’s regressions report it to be significant only for homicides. In this one the magnitude of the effect is very small, though.
  • OLS’s estimations interpretation is that for and increase in the percentage of population with age between 25 and 44 years old holding a Bachelor’s Degree by 1 percentage points, total criminality would decrease by 0.09 per 1000 inhabitants. It doesn’t seem a big number.
  • A consideration we can make by looking at the barplot presented in the univariate visualization section is that, North-East and Mid-West regions have the highest percentage of educated population and lower incidence of crimes with respect to South and West.

Is the composition of the population, in terms of both age and ethnicity, relevant for criminality in the area?

  • Population’s age among different states and regions does not vary significantly, therefore, through our study we can’t say much. The only thing we can extrapolate from our project regarding age-distribution comes from the corrplot. A younger population (18-44) leads to higher homicides, aggravated assaults and violent crimes. Meanwhile, older population (45+) appears negatively related with crimes. But, regressions’ output are inconclusive since estimates are all positive and with great magnitudes.

  • Population’s race composition could play a role. Indeed, we see that South region in US has the highest percentage of Black-African American and the highest incidence of crimes, supporting the positive correlation found on the corrplot between all kinds of crimes and Black-African American. White population is positively correlated with rape. Although from the regressions we observe that the coefficients for all races are negative when looking at total criminality. For homicides, the significant estimates for race are for black african american (1 percentage point increase in black-african american population leads to 1 homicide more in 1000 inhabitants) and asian (1 percentage point increase in asiatic population leads to 0.8 homicide less in 1000 inhabitants).

Is mental health expenditure affected by how much the population is educated or by GDP of the country?

By looking at correlations and the time series reported in previous section, we would answer yes. It exists a positive relationship between the two variables, thus, the more educated the population, the higher the expenditure on mental health in the state. We can represent this findings also in the following scatterplot with the linear regression.

Conclusions

Take home message

Up to now we could only try to guess why such a correlation exist, and which are the social factor that induce such a result.

Such opinions for the correlations are the following:

  • Right now he have found three effects that we want to discuss for one last time. There is a negative correlation between mental health spending and violent crimes, we think that this is a reasonable correlation since taking care of possible dangerous people could reduce the impact on criminality, or at least reduce the relapse.
  • There is a negative correlation between education and criminality, this also can be reasonable as a correlation since the education not only increases the hard skills, but also teaches people how to live in a civilized society, as well as it creates networks (easier to get help) and awareness on social problems.
  • Finally there is a positive correlation between GPD and crimes, except for rape. At a first glance we thought this wasn’t a good instance because a wealthier state will have less criminality than a poorer state. However after thinking a little bit about the possible social reasons behind it we thought that this might be caused by the social distance between poor and rich people. In a wealthier state the distance between wealthier and poorer might be high. This might induce more people to commit violent crimes to gain money or reduce debt. This might also explain why there is a positive effect for rape, in fact out of the four effect that we considered this is the one that is least related to possible wealth change in the individual that commits the crime. Although, these are only supposition, therefore we looked up and research finding a vast literature on how higher GDP has on average a positive effect on criminality. It results that there could be some simultaneous causality considerations to do, since GDP, education, unemployment and poverty are strictly linkes factors.
    Some references are Effect of GDP on Violent Crime, Northrup, Klaer, The Relationship between Crime and Economic Growth in Malaysia: Re-Examine Using Bound Test Approach, 2016, Mulok, Kogid, Lily, Asid

Limitations

  • control variables - omitted variables
  • method - we don’t know any algorithm
  • States all in US - increase sample

Future work?

  • implement other methodology for modelling
  • include other control variables such as n° police, n° protest, wealth disequality index